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Abstract —Anomaly detection is an important task in many 
real world applications such as fraud detection, suspicious activity 
detection, health care monitoring etc. In this paper, we tackle this 
problem from supervised learning perspective in online learning 
setting. We maximize well known Gmean metric for class- 
imbalance learning in online learning framework. Specifically, 
we show that maximizing Gmean is equivalent to minimizing a 
convex surrogate loss function and based on that we propose 
novel online learning algorithm for anomaly detection. We then 
show, by extensive experiments, that the performance of the 
proposed algorithm with respect to sum metric is as good as a 
recently proposed Cost-Sensitive Online Classlfication(CSOC) al¬ 
gorithm for class-imbalance learning over various benchmarked 
data sets while keeping running time close to the perception 
algorithm. Our another conclusion is that other competitive 
online algorithms do not perform consistently over data sets 
of varying size. This shows the potential applicability of our 
proposed approach. 

Index Terms —Class-Imbalance Learning, Online learning. 
Anomaly detection. 

I. Introduction 

Anomaly detection aims to capture behavior in data that do 
not conform to the normal behavior as expected by domain 
expert IB. Anomaly detection in online setting is an important 
task in many real world applications. Eor example, intrusion 
detection in computer network, flight navigation system, credit 
card fraud detection and so on. It is clear that such task 
require detection of malicious activity on the fly. However, 
most existing techniques focus on offline training of the model 
and then use it to detect anomalies d, a, ii, 0. 

In this paper, we tackle anomaly detection problem from 
supervised learning perspective in online setting. In supervised 
learning framework, anomaly detection refers to correctly clas¬ 
sifying rare class examples as compared to majority examples. 
Eor example, out of 100 persons visiting a doctor for cancer 
check up, saying a person, who actually have the cancer, 
not having cancer is more costly than saying he has the 
cancer when he actually does not. Therefore, we take anomaly 
detection problem as class-imbalance learning problem. Hence 
from now and onwards, we will use class imbalance learning 
to refer to anomaly detection. 
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In what follows, we present related work in section 2. 
Section 3 is devoted to problem formulation in online learning 
setting and online algorithm for class-imbalance learning. 
Experimental results are presented in section 4 and finally 
section 5 concludes the paper. 

II. Related Work 

Work presented in this paper spans two main themes in 
data mining and machine learning; Online learning and class- 
imbalance learning. Although there have been many works in 
both domain separately 0, Q, little work has been done that 
jointly solves online learning and class-imbalance learning. 
Below we briefly describe work in each domain that closely 
matches our work. 

A. Online learning 

Online learning aims to process one example at a time. Eirst, 
it receives an examples and then makes a prediction. If pre¬ 
diction goes wrong, it suffer loss and updates its parameters. 
Online learning has its origin from classic work of Rosenblatt 
on perceptron algorithm IHl. Perceptron algorithm is based on 
the idea of single neuron. It simply takes an input instance 
and learn a linear predictor of the form /(w) = w^x, where 
w is weight vector and x is the input instance. If it makes a 
wrong prediction, it updates its parameter vector as follows; 

Wt+i = wt-f 2/tXt (1) 

where is weight vector at time t + 1. 

m proposed online learning with kernels. Their algorithm, 
called NORMA\, is based on regularized empirical risk 
minimization which they solved via regularized stochastic 
gradient descent. They also showed empirically how this 
can be used in anomaly detection scenario. However, Their 
algorithm requires tuning of many parameters which is costly 
for time critical applications. Passive-Aggressive (PA) learning 
Cl is another online learning algorithm based on the idea 
of maximizing “margin” in online learning framework. PA 
algorithm updates the weight vector whenever “margin” is 
below a certain threshold on the current example. Eurther, 
the same author have introduced the idea of slack variable 
to handle non-linearly separable data. 


B. Class-Imbalance Learning 

Class-imbalance learning aims to correctly classify minority 
examples. In literature, there exist solutions that are either 
based on the idea of sampling or weighting scheme. In the 
former case, either majority examples are under sampled 
or minority examples are over sampled. In the latter case, 
each example is weighted differently and idea is to learn 
these weights optimally. Some examples of sampling based 
techniques are SMOTE fb), SMOTEBoost ||2l, AdaBoost.NC 
HD and so on. Work that used weighting scheme include 
cost-sensitive learning na, 03, m, iia and so on. 

It is worthwhile to mention here that only a few work exists 
that jointly solves class-imbalance learning and online learn¬ 
ing. Below, we mention some work that closely matches our 
work. In na, author proposed sampling with online bagging 
(SOB) for class imbalance detection. Their idea essentially is 
based on resampling. That is, oversample minority class and 
under sample majority class from Poisson distribution with 
average arrival rate of N/P and Rp respectively where P 
is total number of positive examples, N is total number of 
negative examples, and Rp is recall on positive examples. M 
also tries to maximize Gmean, but the way they approached to 
solve the maximization problem is different from our present 
work. ini proposed online cost sensitive classification of 
imbalanced data. One of their problem formulation is based 
on maximizing weighted sum of sensitivity and specificity 
and the other is minimizing weighted cost. Their solution is 
based on minimizing convex surrogate loss function (modified 
hinge loss) instead of non-convex 0-1 loss. HD work closely 
matches our work. But, in section 3 we show the major 
difference between the two work. 

III. Eramework of Class-Imbalance Learning 
A. Problem formulation 

Without loss of generality, consider binary classification 
problem. Eormally, let X be instance space in TZ'^ and 
y be label space in { —1,-|-1}. We are given samples 
S = {(xi,yi), (x 2 , 2 / 2 ), (xt,2/t)} where instance-label 

pair (xi, yi) G Xxy, and i G {1, 2,..., T}. In online learning 
setting, no assumption is made about the distribution of the 
samples and they come sequentially . Let X( be an instance 
received at time step t and ft be the model that is obtained 
from previous t — 1 rounds. Let y be the prediction for the 
f-th instance i.e y = sign{ft{xt)), whereas the value |/t(xt)| 
known as ’’margin”, is used as the confidence of the learner 
on the f-th prediction step. 

Eor binary classification task, let P and N respectively 
denote the number of positive and negative instances received 
so far. Let TP,TN, FP and FN denote number of true 
positive, true negative, false positive and false negative so 
far. Mathematically, they are defined as: TPt — {y = y = 
= {y = y = -l},FPt = {y = -I,y = 
-\-l},FNt = {y = +I,y = —1} where subscript t denote 
these metric value at that time step. TP is calculated by 
summing TPt from t = 1 to T. Similarly TN, FP and FN 
can be calculated. 


Without loss of generality, assume positive instances are 
minority class. It is well known that maximizing accuracy as 
a measure of performance on class-imbalanced data leads to 
false conclusion. Lor example, suppose training data contains 
100 examples having 95 negative examples and 5 positive 
examples. If a classifier predicts each example as negative 
it will have accuracy of 100%. Thus missing all the positive 
examples to classify correctly. 

Therefore, we require a metric that can account for classi¬ 
fying minority class correctly. Gmean ifT^ is such a metric 
that evaluates the degree of inductive bias in terms of a ratio 
of positive accuracy and negative accuracy. Mathematically, 
Gmean is defined as: 


Gmean = s/recalN x recall 


( 2 ) 


where recalN and recall denote accuracy on positive and 
negative examples respectively and defined as: 


TP 

recall^ = —- —.recall 

TP+ FN' 


TN 

TN + FP 


Lor simplicity, we assume that ||x(|| < 1 which says that 
all incoming instances lies within a unit ball. However, this 
restriction can be relaxed in a more general setting, that is, we 
can take ||x 4 || < r, where r is radious of the Euclidean ball 
centered at 0. This condition ensures that cumulative regret, 
that is, performance of online learner with respect to a fixed 
learner chosen in hindsight, can be bounded. Our objective is 
to maximize (??). 

Suppose we are given a linear classifier of the form /(x) = 
w^x, where w represents the weight of the linear classifier 
that we wish to learn in an online manner. Whenever classifier 
makes a mistake i.e j//(x) < 0, we update the weight. How 
weights are actually updated will be explained in the following 
section. 

B. Online Setting 

In this subsection, we prove the following lemma due to 
ifTbll for the sake of completeness. Then we show the hardness 
of optimizing the equivalent formulation in the lemma and 
propose to optimize a convex surrogate loss function instead. 

Lemma 1: Maximizing Gmean as given in (??) is equiva¬ 
lent to minimizing the following objective: 


^ T ^ — FN 

/ . )p-'(!/t/t(xt)<o) + 2^ -jy -'(yt/t(xt)<o) (3) 

yt=+t yt=-i 

Proof: Maximizing (??) is equivalent to maximizing its 

square. So we can write it as follows: 


Gmean = 
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where in step 2, we used the fact that P = TP + FN and 
N = FP + TN. Hence we minimize: 


Gmean = 


FN 
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P-FN _ 
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where in step 2, we have taken off the constant N from 
the equation without changing the optima and I{C) is an 
indicator function that outputs 1 when argument C is true and 
0 otherwise. □. 

Lemma^gives an alternative objective to maximize Gmean. 
But, it invmves indicator function that is non-convex. Hence, 
we resort to convex relaxation techniques. More specifically, 
we use the following convex surrogate loss function: 

r(w; (x,y)) = max(0,p-y/t(xt)) (5) 


where 


Algorithm 1: Online G-mean(OGMEAN) algorithm 
Input: Learning rate t 
Initialization wi = 0 
Output: Weight vector wt+i 
for f = 1,..., T do 
receive instance: xu 
predict :yt = sign(wt.Xt); 
receive correct label: yt G {—1,+1} ; 
suffer loss: £i(wt) as given in (??); 
if £t(wt) > 0 then 

I Wt+i = wt + rytXt ; 

else 

I wj+i = Wt ; 

end 


Since we are interested in online learning, we define an 
instantaneous risk a of using prediction function ft on 
example {xt,yt) as follows: 

Remp[ft;{xt,yt)] = 'C(/t(xt),2/t) + ^ll/ilP (9) 


p = 



P-FN 


'(y=-i) 


Next, we need to determine how our model is performing 
over sequence of inputs during online learning. For that, 
cumulative mistake is often used and defined as: 


We see that (??) is similar to modified hinge loss function. 

Our next goal is to put the modified loss function into online 
learning framework. In online learning, learner receives one 
example at a time and makes it prediction. If it goes wrong, it 
updates it prediction function /. Here we will cast this problem 
into empirical risk minimization(ERM) learning framework. In 
offline learning under ERM framework, measure of quality of 
/ is expected risk 0 

Remp\f [(y), y] (6) 


T 

CcumUuT] = ^/:(/t(xt),2/t) (10) 

t=l 

We note here that ft is tested on an example (xt,t/t) which 
was not used for fitting ft- Thus, if we can guarantee low 
cumulative mistake, we can prevent overfitting without the 
need of regularizer 0. We used this fact in our current work 
and do not provide explicit regularization parameter. 

Thus our objective is to minimize (??). To this end, we use 
online gradient descent G 1 to optimize (??). 


Since distribution V is unknown in general, one instead 
minimizes empirical risk 

1 

Remplf, S] = IfYl yt) (” 7 ) 

One can further regularizes empirical risk to avoid overfitting. 

T 

Remp[f,S] = ^^C{f{xt),yt) + ^II/IP (8) 

where A is a regularization parameter that controls trade¬ 
off between complexity of the model and correctness of the 
prediction. 

^It can be shown that problem formulation presented in this paper is similar 
to one presented in oil but not equivalent since they were minimizing 
weighted sum of sensitivity and specificity where weights are decided by 
Laplace estimation. In our formulation, we have directly estimated these 
weights in terms of P, N, and FN. 


Wt+l = Wt - TV£t(wt) (11) 

where V£t is loss incurred at time t on example X(. To find 
V£t for our loss function (??), we differentiate it with respect 
to wt and update rules becomes: 

_/ wt+rytXt if £t(wt) > 0 

\ Wt otherwise 

C. Algorithm 

Our proposed algorithm Online G-MEAN (OGMEAN) is 
given in Algorithm [T] As we can see that OGMEAN requires 
only one parameter to tune; t, the learning rate. In general it 
is set to l/Vi. However, for simplicity we set it to a constant 
in our experiments. 

In is clear that OGMEAN takes time proportional to 0(r x 
n), which is linear in the number of dimension of the input 
instances as well as number of received instances so far. 










TABLE I: Summary of datasets used in the paper 


Dataset 

# Examples 

#Features 

#Pos:Neg 

covtype 

581012 

54 

1:1 

german 

1000 

24 

1:2.3 

svmguide3 

1243 

21 

1:3 

ijcnnl 

141691 

22 

1:10.44 


D. Relative Loss Bound for OGMEAN 

Following lEl, we can bound the regret of OGMEAN 
algorithm. Below we just state the lemma without proof. 

Lemma 2: Let S = {{xt,yt)}t=i,...,T be the sequence of 
T examples where Xt G X, yt & y and ||a;t|| < 1 for all t. 
Then for any w S A”, by setting r = ||w||-\/r, the following 
holds for OGMEAN: 

T T 

^£t(wt) < ^£t(w) + ||w||a/t □. 

t=i t=i 

IV. Experiments 

In this section, we experimentally validated the accuracy of 
the proposed OGMEAN algorithm over various bench marked 
data set which can be freely downloaded from LIBSVM 
websit^ A brief summary of the data set and class-imbalance 
ratio is given in Table [I] All the algorithms were run in 
MATLAB 2012a (64 bit version) on 64 bit Windows 8.1 
machine. We compare our algorithm with recently proposed 
CSOCsum algorithm ifTTl with respect to a metric called sum 
which is dehned as: 

sum = Up X sensitivity + n„ x specificity (13) 

where Up and are weight parameter which in ifTTll are set 
manually to 0.5 each and sensitivity and specihcity are the 
same as recall on positive and negative examples respectively. 
It is claimed in El that CSOCsum algorithm beats state- 
of-the-art online algorithms for class-imbalance problem with 
respect to sum metric. The algorithms compared in El are 
Perceptron il, ROMM A, agg-ROMMA, PA-I, PA-II, CPApb 
ifTOl and PAUM ll20l . Since CSOCsum performs equal or bet¬ 
ter than all the above algorithms, we only compare OGMEAN 
to CSOCsum with respect to sum metric in this paper. We also 
show the mistake rate, number of updates and running time 
of all the algorithms for fair comparison. For this purpose, we 
used the LIBOL online learning library 

A. Evaluation of Weighted Sum Performance 

We conducted a comparative study of CSOCsum and OG¬ 
MEAN algorithms where we set learning rate parameter t 
equals to 0.2 and weights for “sum” i.e Up and equal 
to 0.5 over all the data sets and both algorithms as done in 
CSOCsum- Online “sum” performance of the two algorithms 
are shown in ??????, and Table |II] over four data sets: 
“german”, “ijcnnl”,“svmguide3”, and “covtype”. We can draw 

^http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/data.sets 

^http;//stevenhoi.org 



several conclusions from these results. First, we can see in 
?????? that OGMEAN achieves equal or higher “sum” value 
as compared to CSOCsum over all the data sets. It shows the 
potential drawback in using Laplace estimation to estimate 
p in CSOCsum algorithm. From the Table we infer that 
OGMEAN beats CSOCsum algorithm in terms of Mistake rate. 
No of updates applied on weights and CPU time on almost all 
the data sets. It again justifies the potential benefits of using 
OGMEAN in real world applications. 

B. Comparative Study 

We conducted another experiment for comparing mistake 
rate, cumulative number of updates and cumulative time cost 
over “covtype” and “german3” data sets. For all the algorithms 
compared, cost parameter C is set to 1 except ALMA for 
which the value of C is v/2. We kept the values of all other 
parameters same as given in LIBOL implementation. From ??, 
we observe that both CSOCsum and OGMEAN has cumula¬ 
tive running time approaching perception algorithm. In terms 
of the number of mistakes, both CSOCsum and OGMEAN 
outperform all other algorithms. On the other hand, in ??, 
we observe that both CSOCsum and OGMEAN did not do 
well in comparison to SCW-I, SCW-II, NHERD, and ALMA 
in terms of the number of mistakes. However, CSOCsum and 
OGMEAN took less time as compared to all other algorithms 
except PERCEPTRON. 

In addition, we also tested comparative performance of all 
the algorithms over other bench marked data sets (“ijcnnl”, 
“svmguide3”) where we found that performance of CW, SCW- 
I, and SCW-II was better than all other algorithms. However, 
result presented in this paper indicates that aforementioned 
algorithms are not so well performing on “german” and 
“covtype” as compared to CSOCsum and OGMEAN. This 
leads us to a conclusion that there is lack of consistency of 
performance in terms of the number of mistakes made by these 
algorithms except CSOCsum and OGMEAN. 

V. Conclusion 

In the present work, we tackled binary class imbalance 
learning under online learning framework. We maximize pop¬ 
ular Gmean metric for class imbalance problem. We showed 
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Fig. 1; Online “sum’ ’Performance of OGMEAN on various data sets 


that maximizing Gmean equivalent formulation is non-convex 
and hence used convex surrogate loss function under empirical 
risk minimization framework. We compared our OGMEAN 
algorithm performance with recently proposed CSOCsum al¬ 
gorithm over various bench marked data sets. It is found that 
directly optimizing weighted sum of sensitivity and specificity 
where weights are learned using Laplace estimation techniques 
is less efficient as compared to directly optimizing equivalent 
formulation of maximizing Gmean. We also showed mistake 
rate, cumulative time cost and number of updates of our algo¬ 
rithm with respect to many other online learning algorithms 
and concluded that its performance is as good as or better than 
these recent online algorithms. 

In our future work, we plan to extend the work to multi-class 
setting. Concretely, how can we optimize Gmean metric for 
multi-class in online scenario? The problem is that Gmean for 
multi-class is non-decomposable loss function that prohibits us 
to use any existing optimization techniques. So, further work 
is required in this direction. 
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TABLE II: Evaluation of performance of CSOCsum and OGMEAN algorithms with respect to “sum” metric 


Data Set 

Algorithm 

Mistake rate 

No. of updates 

CPU time(sec) 

covtype 

CSOCsum 

0.2511 ± 0.0003 

318603.35 ± 116.12 

34.4843 ± 1.0240 

OGMEAN 

0.2534 ± 0.0003 

315907.00 ± 125.20 

33.9553 ± 0.2790 

german 

CSOCsum 

0.3235 ± 0.0087 

739.15 ± 5.81 

0.0603 ± 0.0036 

OGMEAN 

0.3231 ±0.0082 

716.25 ± 4.84 

0.0599 ± 0.0037 

svmguideS 

CSOCsum 

0.2800 ± 0.0053 

857.65 ± 8.16 

0.0722 +/- 0.0046 

OGMEAN 

0.2750 ± 0.0077 

787.90 ± 9.34 

0.0724 ± 0.0041 

ijcnnl 

CSOCsum 

0.2742 ± 0.0007 

82706.60 ± 119.07 

7.8521 +/- 0.1987 

OGMEAN 

0.2754 ± 0.0008 

81877.85 ± 102.83 

7.8316 ± 0.0529 
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Fig. 3: Comparative study of mistake rate, cumulative number of updates and cumulative time cost over german data set 
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Fig. 4: Comparative study of mistake rate, cumulative number of updates and cumulative time cost over covtype data set 














































